01 - R framework with IMPACT - session 1

Author

Yann Say

Published

April 10, 2024

IMPACT R framework


The IMPACT R framework has been developed to meet the requirements of the IMPACT research cycle.

The ecosystem is a modular framework with two dimensions:

  • a horizontal dimension that focuses on the outcome of a given step, and
  • a vertical dimension that focuses on the content of a given step.

The framework is built around:

  • 4 steps: cleaning, composition, analysis, outputs
  • 4 verbs: check, add, create, review
  • 2 adjectives: pipeable, independent

These elements will help to improve cooperation and collaboration between different teams while allowing modularity to adapt to each context and assessment.


library(cleaningtools)
library(dplyr)

my_raw_dataset <- cleaningtools::cleaningtools_raw_data
my_kobo_survey <- cleaningtools::cleaningtools_survey
my_kobo_choice <- cleaningtools::cleaningtools_choices

Checking a dataset

check_*

check_* functions will flag values based on a specific check. It will return them in a log. A *check_** will return a list: the checked dataset, and the log.

check_* functions are used only in the cleaning step.

check_outliers

my_log1 <- my_raw_dataset %>% 
  check_outliers(uuid_column = "X_uuid")

In this example, there are:

  • checked_dataset: the raw dataset (with extra variables if needed)
  • potential_outliers: a log of potential outliers
typeof(my_log1)
[1] "list"
my_log1 %>% 
  names()
[1] "checked_dataset"    "potential_outliers"

The log has at least 4 columns:

  • uuid: the unique identifier
  • issue: the issue being flagged
  • question: the name of the question
  • old_value: the value being flagged
my_log1$potential_outliers %>% 
  head()
uuid issue question old_value
b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 outlier (normal distribution) age_respondent_r 86
956b5ed0-5a62-41b7-aec3-af93fbc5b494 outlier (normal distribution) age_respondent_r 84
97ad6294-30c6-454e-a0b3-42126415b767 outlier (log distribution) age_respondent_r 18
e005e719-57c4-44a3-ac2f-5d6d1ff68831 outlier (log distribution) age_respondent_r 18
c9aaa542-118f-4e42-93de-fb0916572541 outlier (normal distribution) num_hh_member 19
48e8896b-d1be-4600-8839-2d8b994ebcfb outlier (normal distribution) num_hh_member 19
Note

Outliers are defined as +/- 3 standard deviation from the mean.

For log outliers, log(x + 1) is used.

check_duplicate

my_log2 <- my_raw_dataset %>% 
  check_duplicate(uuid_column = "X_uuid")

my_log2$duplicate_log %>% 
  head()
uuid old_value question issue

There is no duplicate. The log is empty.

Pipe-able

The framework is built around 2 adjectives, pipe-able and independent. In the framework, functions of the same family should be pipe-able. In the following case, 2 check_* functions are piped.

my_log3 <- my_raw_dataset %>% 
  check_outliers(uuid_column = "X_uuid") %>% 
  check_duplicate(uuid_column = "X_uuid")
names(my_log3)
[1] "checked_dataset"    "potential_outliers" "duplicate_log"     
my_log3$potential_outliers %>% 
  head()
uuid issue question old_value
b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 outlier (normal distribution) age_respondent_r 86
956b5ed0-5a62-41b7-aec3-af93fbc5b494 outlier (normal distribution) age_respondent_r 84
97ad6294-30c6-454e-a0b3-42126415b767 outlier (log distribution) age_respondent_r 18
e005e719-57c4-44a3-ac2f-5d6d1ff68831 outlier (log distribution) age_respondent_r 18
c9aaa542-118f-4e42-93de-fb0916572541 outlier (normal distribution) num_hh_member 19
48e8896b-d1be-4600-8839-2d8b994ebcfb outlier (normal distribution) num_hh_member 19
my_log3$duplicate_log %>% 
  head()
uuid old_value question issue

More checks

This an example of more checks that exist.

more_logs <- my_raw_dataset %>% 
  check_duplicate(uuid_column = "X_uuid") %>% 
  check_soft_duplicates(uuid_column = "X_uuid", kobo_survey = my_kobo_survey, sm_separator = ".") %>%
  check_outliers(uuid_column = "X_uuid") %>%
  check_value(uuid_column = "X_uuid") 
add_*

add_* functions will add a variable (column) to the dataset. For example, to add the duration of a survey, to add the food consumption score category, etc.

add_* function takes a dataset as input and returns the dataset + the new indicator (and any intermediate steps used for the calculation).

For example, to check the duration of a survey, there is only the start and end, but not the duration column.

add_duration

more_logs$checked_dataset <- more_logs$checked_dataset %>% 
  add_duration(uuid_column = "X_uuid", start_column = "X.U.FEFF.start", end_column = "end")
more_logs$checked_dataset[1:6, c("start_date", "start_time", "end_date", "end_time", "days_diff", "duration")]
start_date start_time end_date end_time days_diff duration
2021-07-05 658.57 mins 2021-07-05 696.68 mins 0 days 38.11
2021-07-05 608.90 mins 2021-07-05 641.92 mins 0 days 33.02
2021-07-05 682.23 mins 2021-07-05 726.43 mins 0 days 44.20
2021-07-04 1342.98 mins 2021-07-04 1380.15 mins 0 days 37.17
2021-07-04 1391.62 mins 2021-07-05 18.88 mins 1 days 67.26
2021-07-05 617.38 mins 2021-07-05 756.52 mins 0 days 139.14
Warning

The duration is added to the checked_dataset in the list, not in the my_raw_dataset dataframe. The check_* functions are used in a pipe, so it needs the current dataset to be modified.

Warning

At the moment, add_duration takes very specific format. It will change in the future to become more robust and using lubridate.

check_duration can now be used with the previous checks.

more_logs <- more_logs %>% 
  check_duration(column_to_check = "duration", uuid_column = "X_uuid")

As much as possible, check_* functions take default argument or the functions will be able to guess some information, e.g. the check_outliers function guesses some numerical values. Some functions need more information.

other/text columns

check_other needs the list of columns to be checked. It currently, it cannot detect the open text question. KOBO tool can be used.

other_columns_to_check <- my_kobo_survey %>% 
  dplyr::filter(type == "text") %>% 
  dplyr::filter(name %in% names(my_raw_dataset)) %>%
  dplyr::pull(name) 

more_logs <- more_logs %>% 
  check_others(uuid_column = "X_uuid", columns_to_check = other_columns_to_check) 

check_logical

In other cases, the check is specific and should be tailored to the dataset, for example, check_logical.

example_logic <- my_raw_dataset %>% 
  check_logical(uuid_column = "X_uuid", 
                check_to_perform = "primary_livelihood.employment == 1 & tot_expenses < 200000",
                description = "primary_livelihood is employment but expenses less than 200 000",
                columns_to_clean = c("primary_livelihood", "tot_expenses"))

example_logic$logical_xx %>% 
  head()
uuid question old_value issue check_id check_binding
f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb primary_livelihood employment primary_livelihood is employment but expenses less than 200 000 logical_xx logical_xx / f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb
f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb tot_expenses 125000 primary_livelihood is employment but expenses less than 200 000 logical_xx logical_xx / f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb
e9f8b44c-c507-45a1-8d76-66d886437b8f primary_livelihood employment primary_livelihood is employment but expenses less than 200 000 logical_xx logical_xx / e9f8b44c-c507-45a1-8d76-66d886437b8f
e9f8b44c-c507-45a1-8d76-66d886437b8f tot_expenses 175000 primary_livelihood is employment but expenses less than 200 000 logical_xx logical_xx / e9f8b44c-c507-45a1-8d76-66d886437b8f
994a60b8-e640-425c-9774-160651d7af04 primary_livelihood employment primary_livelihood is employment but expenses less than 200 000 logical_xx logical_xx / 994a60b8-e640-425c-9774-160651d7af04
994a60b8-e640-425c-9774-160651d7af04 tot_expenses 175000 primary_livelihood is employment but expenses less than 200 000 logical_xx logical_xx / 994a60b8-e640-425c-9774-160651d7af04

The log returns :

  • uuid
  • question: for all variables in columns_to_clean
  • old value: for all variables in columns_to_clean
  • issue
  • check_id: logical check identifier
  • check_binding: the combination of the check_id and the uuid.

One check can be flagged in several rows, in the example above, for each uuid, the primary_livelihood and tot_expenses are flagged.

Format for the test_to_perform should take the format based on tidyverse. That format is as if a new indicator is create with a mutate. That new indicator should be a logical (i.e. TRUE or FALSE) with TRUE being the value to flag.

my_raw_dataset %>% 
  dplyr::mutate(xxx =  primary_livelihood == "employment" & tot_expenses < 1000000) %>% 
  dplyr::select(X_uuid, xxx, primary_livelihood, tot_expenses) %>% 
  head()
X_uuid xxx primary_livelihood tot_expenses
dcf2753a-6ea2-40f5-b493-3527931ef96c FALSE loans support 250000
8790ce5c-1c35-41a2-b3c0-538f937d5397 TRUE employment 750000
bb818e04-9c40-408e-919f-6b40ff1fdbb3 FALSE other 250000
28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 TRUE employment 600000
7f2a0c6a-529b-481f-963f-a96dca2ec034 TRUE employment 500000
b4f92064-12ea-4970-b0f5-fd309de1dda3 FALSE retirement_fund 650000

The checked dataset will be return with one extra column, i.e. the logical variable with the name of the check_id.

example_logic$checked_dataset[1:6,tail(names(example_logic$checked_dataset))]
X_submission_time X_notes X_status X_submitted_by X_index logical_xx
2021-07-05T09:34:15 [] submitted_via_web reach_irq 1 FALSE
2021-07-05T09:34:09 [] submitted_via_web reach_irq 2 FALSE
2021-07-05T09:34:33 [] submitted_via_web reach_irq 3 FALSE
2021-07-05T09:32:58 [] submitted_via_web reach_irq 4 FALSE
2021-07-05T09:32:59 [] submitted_via_web reach_irq 5 FALSE
2021-07-05T09:37:33 [] submitted_via_web reach_irq 6 FALSE

check_to_perform is written as a character, between quotes and using with regular expression, e.g. escaping the quotes in “employment”.

example_logic2 <- my_raw_dataset %>% 
  check_logical(uuid_column = "X_uuid", 
                check_to_perform = "primary_livelihood == \"employment\" & tot_expenses < 1000000",
                description = "primary_livelihood is rented but expenses less than 1 000 000")
Warning in check_logical(., uuid_column = "X_uuid", check_to_perform
= "primary_livelihood == \"employment\" & tot_expenses < 1000000", :
columns_to_clean not shared, results may not be accurate
example_logic2$logical_xx %>% 
  head()
uuid question old_value issue check_id check_binding
8790ce5c-1c35-41a2-b3c0-538f937d5397 primary_livelihood employment primary_livelihood is rented but expenses less than 1 000 000 logical_xx logical_xx / 8790ce5c-1c35-41a2-b3c0-538f937d5397
8790ce5c-1c35-41a2-b3c0-538f937d5397 tot_expenses 750000 primary_livelihood is rented but expenses less than 1 000 000 logical_xx logical_xx / 8790ce5c-1c35-41a2-b3c0-538f937d5397
28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 primary_livelihood employment primary_livelihood is rented but expenses less than 1 000 000 logical_xx logical_xx / 28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02
28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 tot_expenses 600000 primary_livelihood is rented but expenses less than 1 000 000 logical_xx logical_xx / 28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02
7f2a0c6a-529b-481f-963f-a96dca2ec034 primary_livelihood employment primary_livelihood is rented but expenses less than 1 000 000 logical_xx logical_xx / 7f2a0c6a-529b-481f-963f-a96dca2ec034
7f2a0c6a-529b-481f-963f-a96dca2ec034 tot_expenses 500000 primary_livelihood is rented but expenses less than 1 000 000 logical_xx logical_xx / 7f2a0c6a-529b-481f-963f-a96dca2ec034
Note

If you don’t include columns_to_clean the check_logical function will try to guess the variables. Not guarantee it will read or pick the correct names.

This inherits from reading a checklist in an Excel format. In Excel, there is no quote.

logical_check_list <- readxl::read_excel("inputs/01 - example - check_list.xlsx")
logical_check_list
check_id description check_to_perform columns_to_clean
check_1 primary_livelihood is employment but expenses less than 200000 primary_livelihood.employment == 1 & tot_expenses < 200000 primary_livelihood.employment,
tot_expenses
check_2 acces water and tank emptied access_water_enough == “totally_insufficient” & tank_emptied == “about_half” access_water_enough, tank_emptied

This list can then be used with check_logical_with_list.

more_logs <- more_logs %>% 
  check_logical_with_list(uuid_column = "X_uuid",
                          list_of_check = logical_check_list,
                          check_id_column = "check_id",
                          check_to_perform_column = "check_to_perform",
                          columns_to_clean_column = "columns_to_clean",
                          description_column = "description")

Exercises

Try the following with a dataset:

  • Perform a check to spot personal identifiable information

Try the function check_pii

my_raw_dataset %>% 
  check_pii(uuid_column = "X_uuid")
  • Perform a check that will look at the percentages of missing value per observation and that will spot any observation that is different.

Try the function check_percentage_missing

Did you try to add a new column with add_percentage_missing

my_raw_dataset <- my_raw_dataset %>% 
  add_percentage_missing(kobo_survey = my_kobo_survey)
  
my_raw_dataset %>% 
  check_percentage_missing(uuid_column = "X_uuid")

Or if using a log already

more_logs$checked_dataset <- more_logs$checked_dataset %>% 
  add_percentage_missing(kobo_survey = my_kobo_survey)
more_logs %>% 
  check_percentage_missing(uuid_column = "X_uuid")
  • Create logical check with the following checks. They should all be in an excel file.
    • household number (variable: num_hh_member) is above 8.
    • the water source for drinking water is bottled (variable: water_source_drinking, value: bottled) and the household always treat the drinking water (variable: treat_drink_water, value: always_treat).
    • the water source for drinking water is bottled (variable: water_source_drinking, value: bottled) and one of the main reason for the the household to not meet its water needs is the water pressure (variable: access_water_enough_why_not, value: water_pressure, this is a select multiple)
my_check_list <- data.frame(check_id = c("check_household number", "check_water_treatment", "check_3"),
                            description = c("num_hh_member is big","using bottled water and always treat","using bottled water and main reason is water pressure"),
                            check_to_perform = c("num_hh_member > 8","water_source_drinking == \"bottled\" & treat_drink_water == \"always_treat\"","water_source_drinking == \"bottled\" & access_water_enough_why_not.water_pressure == TRUE"),
                            columns_to_clean = c("num_hh_member","water_source_drinking, treat_drink_water","water_source_drinking, access_water_enough_why_not.water_pressure"))

my_raw_dataset %>% 
    check_logical_with_list(uuid_column = "X_uuid",
                          list_of_check = my_check_list,
                          check_id_column = "check_id",
                          check_to_perform_column = "check_to_perform",
                          columns_to_clean_column = "columns_to_clean",
                          description_column = "description")
  • try add_duration on your dataset There is no specific answer just now.
my_raw_dataset$new_start <- ymd_hms(my_raw_dataset[["start"]])
my_raw_dataset$new_end <- ymd_hms(my_raw_dataset[["end"]])
my_raw_dataset$duration <- difftime(as.POSIXct(my_raw_dataset$new_start), as.POSIXct(my_raw_dataset$new_end), units = "mins")